Data Explanation

The dataset is extracted from a large database consisting of 76 variables, and it uses only 15 variables from the original data. The objective of the field is to test whether there is the presence of heart disease in a patient. It is integer valued from 0 (no presence) to 4. However, experiments with the database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0). Therefore, in terms of this dataset, we will test the efficiency of different classification algorithms. The label variable is the "target" variable containing only two values: 0 and 1.

15 attributes used:

Input variables:

  1. patientid: ID of the patient
  2. age: age in years
  3. sex: sex (1 = male; 0 = female)
  4. cp: chest pain type -- Value 1: typical angina -- Value 2: atypical angina -- Value 3: non-anginal pain -- Value 4: asymptomatic
  5. trestbps: resting blood pressure (in mm Hg on admission to the hospital)
  6. chol: serum cholestoral in mg/dl
  7. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
  8. restecg: resting electrocardiographic results -- Value 0: normal -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) -- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
  9. thalach: maximum heart rate achieved
  10. exang: exercise induced angina (1 = yes; 0 = no)
  11. oldpeak = ST depression induced by exercise relative to rest
  12. slope: the slope of the peak exercise ST segment -- Value 1: upsloping -- Value 2: flat -- Value 3: downsloping
  13. ca: number of major vessels (0-3) colored by flourosopy
  14. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

Predicted Variable:

  1. target: diagnosis of heart disease (angiographic disease status) -- Value 0: < 50% diameter narrowing -- Value 1: > 50% diameter narrowing (in any major vessel: attributes 59 through 68 in the full dataset are vessels)

Data Preparation

In [27]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

import time
In [23]:
#Load the data

heart = pd.read_csv("D:/Business Analytics Program/Courses/Applied Machine Learning/datasets/heart.csv",) 

heart.head(5)
Out[23]:
patientid age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
0 1 63 1 3 145 233 1 0 150 0 2.3 0 0 1 1
1 2 37 1 2 130 250 0 1 187 0 3.5 0 0 2 1
2 3 41 0 1 130 204 0 0 172 0 1.4 2 0 2 1
3 4 56 1 1 120 236 0 1 178 0 0.8 2 0 2 1
4 5 57 0 0 120 354 0 1 163 1 0.6 2 0 2 1
In [24]:
#Review the data

#heart = heart.drop(['patientid','fbs','restecg','exang','ca','slope','sex','cp','thal'],axis =1)
heart = heart.drop(['patientid'],axis =1)
print(heart.dtypes)
rows, columns = heart.shape
print("Rows:", rows)
print("Columns:", columns)
age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target        int64
dtype: object
Rows: 303
Columns: 14

Totally, the dataset includes 302 patients or datapoints. The "patientid" is dropped here because it has little meaning.

In [25]:
#check if any variables containing null values

heart.isnull().any() 
Out[25]:
age         False
sex         False
cp          False
trestbps    False
chol        False
fbs         False
restecg     False
thalach     False
exang       False
oldpeak     False
slope       False
ca          False
thal        False
target      False
dtype: bool

We can see that there is no attribute containing null value.

In [21]:
#Use pairs plot illustrate the data

sns.pairplot(heart.iloc[:,7:15])
Out[21]:
<seaborn.axisgrid.PairGrid at 0x2359523e5c8>

Here, I built some plots which express the relationship between each pair of variable among these variables:"thalach","exang","oldpeak","slope","ca","thal","target".

Inferred from the graphs in the last row, there are obviously two classes in the "target" variable.

In [26]:
#Split the data set in predictors and predicted

feature_cols = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal']

X = heart[feature_cols] # Features
y = heart.target # Target variable

To evaluate the consistency of the results and build learning curves for each algorithm, the testing size will be set to run between 0.05 and 0.95 (testing size can not be 0 and 1 either), the learning rate will be 0.05 and that makes a total number of 19 testing sizes. Each algorithm will be run in 3 three times (each time with a random state number range between 0 and 2).

Logistic Regression

Important parameters for Logistic Regression in scikit learn:

  • penalty: {‘l1’, ‘l2’, ‘elasticnet’, ‘none’}, default=’l2’

  • solver{‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’

  • multi_class{‘auto’, ‘ovr’, ‘multinomial’}, default=’auto’

The used dataset is a binary one, so the multi_class parameter will be assigned "ovr" (one versus rest) or if "auto" is chosen, the parameter will be automatically assigned "ovr".

‘lbfgs’ solver and 'newton-cg' solver supports l2 or none penalty, which specifies the norm used in the penalization while 'sag' solver only supports l2 penalty.

'saga' solver supports all the kind of penalty. When choosing penalty = 'elasticnet' and choose the value of l1_ratio between 0 and 1, the norm becomes a combination of l1 and l2.

'liblinear' solver can handle l1 penalty, but it does not support penalty = none.

In [218]:
# Approach 1: solver = lbfgs (default)

lbfgs_train_accuracy = np.zeros([3,19])
lbfgs_test_accuracy = np.zeros([3,19])
lbfgs_time = np.zeros([3,19])
lbfgs_size = np.zeros([3,19])
j = 0.05
k = 0
for i in range (3):
    while j <1:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
        
        Start_time = time.time() #Saving current time
        
        LogisticReg = LogisticRegression(penalty = 'l2',solver = 'lbfgs')
        LogisticReg.fit(X_train,y_train)

        y_pred_Train = LogisticReg.predict(X_train)
        y_pred_Test = LogisticReg.predict(X_test) 
    
        End_time = time.time() #Saving current time
    
        lbfgs_size = np.append(lbfgs_size,j)
    
        lb1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        lbfgs_train_accuracy[i,k] = lb1

        lb2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        lbfgs_test_accuracy[i,k] = lb2

        lb3 = float(round(End_time - Start_time,6))
        lbfgs_time[i,k] = lb3
    
        j+= 0.05
        k+= 1
    
    j = 0.05
    k = 0
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)

In the result, we found that this model encounter a problem with the number of iterations. One possible solution is the data scaling, so I will adopt this method to the model.

Other solver options named: ‘newton-cg’,‘sag’, ‘saga’ face the same problem and they demand scaling the data before running the algorithm. Therefore, I use the X_scaled instead of the orginal X_train with these solvers.

‘liblinear’ solver is the only one which does not face this problem.

In [137]:
from sklearn import preprocessing
In [219]:
# Approach 1: solver = lbfgs (default)

lbfgs_train_accuracy = np.zeros([3,19])
lbfgs_test_accuracy = np.zeros([3,19])
lbfgs_time = np.zeros([3,19])
lbfgs_size = np.zeros([3,19])
j = 0.05
k = 0
for i in range (3):
    while j <1:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
        X_train_scaled = preprocessing.scale(X_train) # scale the dataset
        X_test_scaled = preprocessing.scale(X_test)
        
        Start_time = time.time() #Saving current time
        
        LogisticReg = LogisticRegression(penalty = 'l2',solver = 'lbfgs')
        LogisticReg.fit(X_train_scaled,y_train)

        y_pred_Train = LogisticReg.predict(X_train_scaled)
        y_pred_Test = LogisticReg.predict(X_test_scaled) 
    
        End_time = time.time() #Saving current time
    
        lbfgs_size[i,k] = j
    
        lb1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        lbfgs_train_accuracy[i,k] = lb1

        lb2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        lbfgs_test_accuracy[i,k] = lb2

        lb3 = float(round(End_time - Start_time,6))
        lbfgs_time[i,k] = lb3
    
        j+= 0.05
        k+= 1
    
    j = 0.05
    k = 0
In [183]:
# plot the results:
x = lbfgs_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,lbfgs_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,lbfgs_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('Lbfgs solver Model_random state_%i' % i)
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

The training set accuracy is high in all the random states, often more than 85% and it tends to increase.

The testing set accuracy seems to fluctuate more greatly in random_state 0 and 1. In these two states, this number mostly ranges between 75% and 85%. In random state 2, testing accuracy gradually fell from a peak of nearly 95%.

In [184]:
# plot the results:

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,lbfgs_time[i,],label='Running time', color = 'red', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Running time')
    ax[i].set_title('Lbfgs solver Model_random state_%i' % i)
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

The running time of the algorithms varies considerably when the testing size changes in all the random states.

Overall, when the testing size increases, the training set accuracy rises but the testing set accuracy tends to decrease. It is best to choose a testing size where the testing set accuracy begins to decrease, that is when the model starts overfitting. The running time can also be considered if the accuracy values obtained when changing the testing size are similar.

Inferred from the graphs, in random_state_0: it is best to choose testing size 20%. The figures for random_state_1 and random_state_2 are 10% and 5% respectively.

In [220]:
# Approach 2: solver = sag

sag_train_accuracy = np.zeros([3,19])
sag_test_accuracy = np.zeros([3,19])
sag_time = np.zeros([3,19])
sag_size = np.zeros([3,19])
j = 0.05
k = 0
for i in range (3):
    while j <1:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
        X_train_scaled = preprocessing.scale(X_train) # scale the dataset
        X_test_scaled = preprocessing.scale(X_test)
        
        Start_time = time.time() #Saving current time
        
        LogisticReg = LogisticRegression(penalty = 'l2',solver = 'sag')
        LogisticReg.fit(X_train_scaled,y_train)

        y_pred_Train = LogisticReg.predict(X_train_scaled)
        y_pred_Test = LogisticReg.predict(X_test_scaled) 
    
        End_time = time.time() #Saving current time
    
        sag_size[i,k] = j
    
        sag1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        sag_train_accuracy[i,k] = sag1

        sag2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        sag_test_accuracy[i,k] = sag2

        sag3 = float(round(End_time - Start_time,6))
        sag_time[i,k] = sag3
    
        j+= 0.05
        k+= 1
    
    j = 0.05
    k = 0
In [186]:
# plot the results:
x = sag_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,sag_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,sag_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('sag solver Model_random state_%i' % i)
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

There is a great similarity between the results obtained with 'sag' and these obtained with 'lbfgs'.

The training set accuracy is also high in all the random states, often more than 85% and it tends to increase.

The testing set accuracy also seems to fluctuate more greatly in random_state 0 and 1. In these two states, this number mostly ranges between 75% and 85%.

In [187]:
# plot the results:

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,sag_time[i,],label='Running time', color = 'red', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Running time')
    ax[i].set_title('Sag solver Model_random state_%i' % i)
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

However, the running time is extremely different when the testing size changes among the random states.

The best testing size for each state is similar to the numbers chosen for 'lbfgs'.

In random_state_0: it is best to choose testing size 20%. The figures for random_state_1 and random_state_2 are 10% and 5% respectively.

Next, we will talk about 'saga' solver. When it comes to 'saga', we should choose the penalty parameter as 'elasticnet' and we can choose the l1_ratio between 0 and 1. l1_ratio = 0 make 'elasticnet' = 'l1'; l1_ratio = 1 make 'elasticnet' = 'l2'; 0 < l1_ratio < 1 makes it a combination of 'l1' and 'l2' penalty. Here we can find the best learning rate for l1_ratio in three different random states.

In [221]:
# Approach 3: solver = saga
# penalty = elasticnet
# find the best l1_ratio
# choose the common testing size = 0.2

saga_train_accuracy = np.zeros([3,20])
saga_test_accuracy = np.zeros([3,20])
saga_time = np.zeros([3,20])
saga_ratio = np.zeros([3,20])
j = 0
k = 0

for i in range (3):
    while j <=1.0:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=i)
        X_train_scaled = preprocessing.scale(X_train) # scale the dataset
        X_test_scaled = preprocessing.scale(X_test)
        
        Start_time = time.time() #Saving current time
        
        LogisticReg = LogisticRegression(penalty = 'elasticnet',solver = 'saga',l1_ratio = j)
        LogisticReg.fit(X_train_scaled,y_train)

        y_pred_Train = LogisticReg.predict(X_train_scaled)
        y_pred_Test = LogisticReg.predict(X_test_scaled) 
    
        End_time = time.time() #Saving current time
    
        saga_ratio[i,k] = j
    
        saga1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        saga_train_accuracy[i,k] = saga1

        saga2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        saga_test_accuracy[i,k] = saga2

        saga3 = float(round(End_time - Start_time,6))
        saga_time[i,k] = saga3
    
        j += 0.05
        k += 1
        
    j = 0
    k = 0

Here is the place where I got trouble with my code, Professor. I could not get the value 1 for l1_ratio.

In [225]:
#plot the results with saga

x = saga_ratio[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,saga_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,saga_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('l1_ratio')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('Saga solver with test size =0.2_random state_%i' % i)
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

In these three random states, the testing set accuracy is stable so it is sensible tp pick a l1_ratio which brings out a high training set accuracy. Inferred from the three graphs, a l1_ration ranging between 0.25 and 0.3 will be good in all the random states. Therefore, I pick l1_ratio = 0.25.

In [226]:
# plot the results with saga:

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,saga_time[i,],label='Running time', color = 'red', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('L1_ratio')
    ax[i].set_ylabel('Running time')
    ax[i].set_title('Saga solver with test size =0.2_random state_%i' % i)
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

The running time fluctuates in all the three random states. We can see that with l1_ratio = 0.25, the running time of the algorithm is lowest in two random states (0 and 1).

Here we have l1_ratio = 0.25, we will find the best testing size to work with it.

In [227]:
# Approach 3: solver = saga
# penalty = elasticnet
# l1_ratio = 0.25
# find the best testing size

saga_train_accuracy = np.zeros([3,19])
saga_test_accuracy = np.zeros([3,19])
saga_time = np.zeros([3,19])
saga_size = np.zeros([3,19])
j = 0.05
k = 0
for i in range (3):
    while j <1:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
        X_train_scaled = preprocessing.scale(X_train) # scale the dataset
        X_test_scaled = preprocessing.scale(X_test)
        
        Start_time = time.time() #Saving current time
        
        LogisticReg = LogisticRegression(penalty = 'elasticnet',solver = 'saga',l1_ratio = 0.25)
        LogisticReg.fit(X_train_scaled,y_train)

        y_pred_Train = LogisticReg.predict(X_train_scaled)
        y_pred_Test = LogisticReg.predict(X_test_scaled) 
    
        End_time = time.time() #Saving current time
    
        saga_size[i,k] = j
    
        saga1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        saga_train_accuracy[i,k] = saga1

        saga2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        saga_test_accuracy[i,k] = saga2

        saga3 = float(round(End_time - Start_time,6))
        saga_time[i,k] = saga3
    
        j+= 0.05
        k+= 1
    
    j = 0.05
    k = 0
In [228]:
#plot the results with saga

x = saga_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,saga_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,saga_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('Saga solver Model_random state_%i' % i)
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

These graphs should remind us of those with 'lbfgs' and 'sag'.

The training set accuracy is also high in all the random states, often more than 85% and it tends to increase.

The testing set accuracy also seems to fluctuate more greatly in random_state 0 and 1. In these two states, this number mostly ranges between 75% and 85%.

In [229]:
# plot the results with saga:

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,saga_time[i,],label='Running time', color = 'red', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Running time')
    ax[i].set_title('Saga solver Model_random state_%i' % i)
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

The lowest running time is recognized in random state 0, the figures for the two other random states varies.

Overall, the best testing size:

In random_state_0: it is best to choose testing size 20%. The figures for random_state_1 and random_state_2 are 10% and 5% respectively.

In [230]:
# Approach 4: solver = newton-cg

newton_train_accuracy = np.zeros([3,19])
newton_test_accuracy = np.zeros([3,19])
newton_time = np.zeros([3,19])
newton_size = np.zeros([3,19])
j = 0.05
k = 0
for i in range (3):
    while j <1:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
        X_train_scaled = preprocessing.scale(X_train) # scale the dataset
        X_test_scaled = preprocessing.scale(X_test)
        
        Start_time = time.time() #Saving current time
        
        LogisticReg = LogisticRegression(penalty = 'l2',solver = 'newton-cg')
        LogisticReg.fit(X_train_scaled,y_train)

        y_pred_Train = LogisticReg.predict(X_train_scaled)
        y_pred_Test = LogisticReg.predict(X_test_scaled) 
    
        End_time = time.time() #Saving current time
    
        newton_size[i,k] = j
    
        newton1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        newton_train_accuracy[i,k] = newton1

        newton2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        newton_test_accuracy[i,k] = newton2

        newton3 = float(round(End_time - Start_time,6))
        newton_time[i,k] = newton3
    
        j+= 0.05
        k+= 1
    
    j = 0.05
    k = 0
In [231]:
#plot the results with newton_cg

x = newton_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,newton_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,newton_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('Newton_cg solver Model_random state_%i' % i)
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

Using 'newton_cg' solver also brings the similar results as the three previous options.

The training set accuracy is also high in all the random states, often more than 85% and it tends to increase.

The testing set accuracy also seems to fluctuate more greatly in random_state 0 and 1. In these two states, this number mostly ranges between 75% and 85%.

In [232]:
# plot the results with newton_cg:

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,newton_time[i,],label='Running time', color = 'red', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Running time')
    ax[i].set_title('Newton_cg solver Model_random state_%i' % i)
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

The lowest running time is recognized in random state 0, the figures for the two other random states varies.

Overall, the best testing size for 'newton_cg' solver:

In random_state_0: it is best to choose testing size 20%. The figures for random_state_1 and random_state_2 are 10% and 5% respectively.

Now, it comes to 'liblinear' option. 'liblinear' solver is the only option which does not face the same problem as other solvers. Therefore, here we can both choose to run the algorithm with 'liblinear' using the the original training dataset and the scaled dataset.

In [256]:
# Approach 5: solver = liblinear
# Using the original training and testing dataset
# for solver ='liblinear', it is useful to set fit_intercept ='True' and increase the intercept_scaling
# set intercept_scaling = 10

liblinearor_train_accuracy = np.zeros([3,19])
liblinearor_test_accuracy = np.zeros([3,19])
liblinearor_time = np.zeros([3,19])
liblinearor_size = np.zeros([3,19])
j = 0.05
k = 0

for i in range (3):
    while j <1:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
        
        Start_time = time.time() #Saving current time
        
        LogisticReg = LogisticRegression(penalty = 'l1',solver = 'liblinear',fit_intercept = 'True',intercept_scaling = 10)
        LogisticReg.fit(X_train,y_train)

        y_pred_Train = LogisticReg.predict(X_train)
        y_pred_Test = LogisticReg.predict(X_test) 
    
        End_time = time.time() #Saving current time
    
        liblinearor_size[i,k] = j
    
        liblinearor1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        liblinearor_train_accuracy[i,k] = liblinearor1

        liblinearor2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        liblinearor_test_accuracy[i,k] = liblinearor2

        liblinearor3 = float(round(End_time - Start_time,6))
        liblinearor_time[i,k] = liblinearor3
    
        j+= 0.05
        k+= 1
    
    j = 0.05
    k = 0
In [257]:
#plot the results with liblinear_or

x = liblinearor_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,liblinearor_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,liblinearor_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('Liblinear_or solver Model_random state_%i' % i) #liblinear_or means the liblinear with the original dataset.
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

All the accuracy metric scores look good (most of training set accuracy values are higher than 85%). Among the three random states, the random state 1 witness a steady increase in the training set accuracy.

In [258]:
# plot the results with liblinear_or:

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,liblinearor_time[i,],label='Running time', color = 'red', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Running time')
    ax[i].set_title('Liblinear_or solver Model_random state_%i' % i)
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

The running time also fluctuates when the testing size changes among the three random states.

For the 'liblinear' solver using the original data:

it is best to choose testing size about 5% in random state 0. The figures for random state 1 and random state 2 are 10% and 5% respectively.

In [259]:
# Approach 6: solver = liblinear
# Using the scaled dataset

liblinearsc_train_accuracy = np.zeros([3,19])
liblinearsc_test_accuracy = np.zeros([3,19])
liblinearsc_time = np.zeros([3,19])
liblinearsc_size = np.zeros([3,19])

for i in range (3):
    j = 0.05
    k = 0
    while j <1:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
        X_train_scaled = preprocessing.scale(X_train) # scale the dataset
        X_test_scaled = preprocessing.scale(X_test)
        
        Start_time = time.time() #Saving current time
        
        LogisticReg = LogisticRegression(penalty = 'l1',solver = 'liblinear',fit_intercept = 'True',intercept_scaling = 10)
        LogisticReg.fit(X_train_scaled,y_train)

        y_pred_Train = LogisticReg.predict(X_train_scaled)
        y_pred_Test = LogisticReg.predict(X_test_scaled) 
    
        End_time = time.time() #Saving current time
    
        liblinearsc_size[i,k] = j
    
        liblinearsc1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        liblinearsc_train_accuracy[i,k] = liblinearsc1

        liblinearsc2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        liblinearsc_test_accuracy[i,k] = liblinearsc2

        liblinearsc3 = float(round(End_time - Start_time,6))
        liblinearsc_time[i,k] = liblinearsc3
    
        j+= 0.05
        k+= 1
In [260]:
#plot the results with liblinear_sc

x = liblinearsc_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,liblinearsc_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,liblinearsc_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('Liblinear_sc solver Model_random state_%i' % i) #liblinear_sc means the liblinear with the scaled dataset.
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

The accuracy metric scores also look good. There is a steady increase in the training set accuracy value in random state 1 and 2.

In [262]:
# plot the results with liblinear_sc:

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,liblinearsc_time[i,],label='Running time', color = 'red', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Running time')
    ax[i].set_title('Liblinear_sc solver Model_random state_%i' % i)
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

The running time also fluctuates considerably when the testing size changes among the three random states.

For the 'liblinear' solver using the original data:

it is best to choose testing size 20% in random state 0. The figures for random_state 1 and 2 is 10 % and 5% respectively.

Now I will create graphs featuring all the solver options with different testing size in 3 random states.

In [272]:
# Create a comprehensive training accuracy plot:

x = liblinearsc_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,lbfgs_train_accuracy[i,],label='lbfgs', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,sag_train_accuracy[i,],label='sag', color = 'green', marker='*',markersize=8)
    ax[i].plot(x,saga_train_accuracy[i,],label='saga', color = 'red', marker='*',markersize=8)
    ax[i].plot(x,newton_train_accuracy[i,],label='newton_cg', color = 'magenta', marker='*',markersize=8)
    ax[i].plot(x,liblinearor_train_accuracy[i,],label='liblinear_or', color = 'yellow', marker='*',markersize=8)
    ax[i].plot(x,liblinearsc_train_accuracy[i,],label='liblinear_sc', color = 'cyan', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('Algorithm Training Accuracy_random state_%i' % i)
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

It is clear that 'newton' and 'liblinear_sc' stand out in these graphs. However, remember that using 'lbfgs', 'sag', 'saga' and 'newton_cg' brings the highly similar results for the training set accuracy. Therefore, 'newton_cg' being plotted after the three mentioned solver is likely to hide them.

In [274]:
# Create a comprehensive running time plot:

x = liblinearsc_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,lbfgs_time[i,],label='lbfgs', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,sag_time[i,],label='sag', color = 'green', marker='*',markersize=8)
    ax[i].plot(x,saga_time[i,],label='saga', color = 'red', marker='*',markersize=8)
    ax[i].plot(x,newton_time[i,],label='newton_cg', color = 'magenta', marker='*',markersize=8)
    ax[i].plot(x,liblinearor_time[i,],label='liblinear_or', color = 'yellow', marker='*',markersize=8)
    ax[i].plot(x,liblinearsc_time[i,],label='liblinear_sc', color = 'cyan', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Running time')
    ax[i].set_title('Algorithm Running time_random state_%i' % i)
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

In all the three running times, 'liblinear_sc' solver helps to bring out the fastest algorithm.

In conclusion, we can see that for the heart dataset, all the solver options bring out the good results (high training accuracy metrics and low running time). However, using 'liblinear' solver with the scaled dataset is the best choice because it is the fastest algorithm option in logistic regression. The most suitable testing size for 'liblinear' solver differs in each running time.

Naives Bayes

In [251]:
from sklearn import naive_bayes #Naive Bayes

from sklearn.metrics import ConfusionMatrixDisplay
In [252]:
# Approach 1: Naive Bayes Gaussian
# Find the best testing size for this algorithm and run it three times to validate the results consistency.

NB_train_accuracy = np.zeros([3,19])
NB_test_accuracy = np.zeros([3,19])
NB_time = np.zeros([3,19])
NB_size = np.zeros([3,19])

for i in range (3):
    j = 0.05
    k = 0
    while j <1:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
        NBayes = naive_bayes.GaussianNB()
        NBayes.fit(X_train,y_train)
        
        Start_time = time.time() #Saving current time

        y_pred_Train = NBayes.predict(X_train)
        y_pred_Test = NBayes.predict(X_test)
        
        End_time = time.time() #Saving current time
        
        NB_size[i,k] = j
    
        NB1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        NB_train_accuracy[i,k] = NB1

        NB2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        NB_test_accuracy[i,k] = NB2

        NB3 = float(round(End_time - Start_time,6))
        NB_time[i,k] = NB3
    
        j+= 0.05
        k+= 1
In [253]:
#plot the results with Naive Bayes Gaussian

x = NB_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,NB_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,NB_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('Naive Bayes Gaussian Model_random state_%i' % i) #
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

The training accuracy metrics look good, having high values (mostly higher than 80%). In random state 1, we recognize that there is a big difference between training accuracy metrics scores and testing accuracy metrics scores.

In [255]:
#plot the results with Naive Bayes Gaussian

x = NB_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,NB_time[i,],label='Running time', color = 'red', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Running time')
    ax[i].set_title('Naive Bayes Gaussian Model_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

Running the model for the first time (random state 0), the time used tends to be stable but in fact it is greater than the running time in the remaining states.

For Naive Bayes Gaussian:

In the first time, the best testing size is 20% and in the second and the third time, the best testing size is the same 10%.

In [278]:
# Approach 2: Naive Bayes Bernoulli
# Use the same method as the one with Naive Bayes Gaussian

NBB_train_accuracy = np.zeros([3,19])
NBB_test_accuracy = np.zeros([3,19])
NBB_time = np.zeros([3,19])
NBB_size = np.zeros([3,19])

for i in range (3):
    j = 0.05
    k = 0
    while j <1:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
        NBayes = naive_bayes.BernoulliNB()
        NBayes.fit(X_train,y_train)
        
        Start_time = time.time() #Saving current time

        y_pred_Train = NBayes.predict(X_train)
        y_pred_Test = NBayes.predict(X_test)
        
        End_time = time.time() #Saving current time
        
        NBB_size[i,k] = j
    
        NBB1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        NBB_train_accuracy[i,k] = NBB1

        NBB2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        NBB_test_accuracy[i,k] = NBB2

        NBB3 = float(round(End_time - Start_time,6))
        NBB_time[i,k] = NBB3
    
        j+= 0.05
        k+= 1
In [279]:
#plot the results with Naive Bayes Bernoulli

x = NBB_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,NBB_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,NBB_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('Naive Bayes Bernoulli Model_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

Compared to Gaussian, Naive Bayes Bernoulli witnesses a greater fluctuation in the training accuracy metrics values. The lowest recorded value is 77% in random state 2.

In [280]:
#plot the results with Naive Bayes Bernoulli

x = NBB_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,NBB_time[i,],label='Running time', color = 'red', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Running time')
    ax[i].set_title('Naive Bayes Bernoulli Model_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

The running time of Naive Bayes Bernoulli algorithm is little and it varies when the testing size changes and the random state changes.

In random state 0, there is a gradual decrease in the running time while this value tends to increase in random state 2 and it fluctuate in the left random state.

For Naive Bayes Bernoulli:

In random state 0, the best testing size for Naive Bayes Bernoulli is 5% while the figures for random state 1 and 2 are the same 10%.

In [281]:
# Approach 3: Naive Bayes Complement
# Use the same method as the one with Naive Bayes Gaussian

NBC_train_accuracy = np.zeros([3,19])
NBC_test_accuracy = np.zeros([3,19])
NBC_time = np.zeros([3,19])
NBC_size = np.zeros([3,19])

for i in range (3):
    j = 0.05
    k = 0
    while j <1:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
        NBayes = naive_bayes.ComplementNB()
        NBayes.fit(X_train,y_train)
        
        Start_time = time.time() #Saving current time

        y_pred_Train = NBayes.predict(X_train)
        y_pred_Test = NBayes.predict(X_test)
        
        End_time = time.time() #Saving current time
        
        NBC_size[i,k] = j
    
        NBC1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        NBC_train_accuracy[i,k] = NBC1

        NBC2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        NBC_test_accuracy[i,k] = NBC2

        NBC3 = float(round(End_time - Start_time,6))
        NBC_time[i,k] = NBC3
    
        j+= 0.05
        k+= 1
In [282]:
#plot the results with Naive Bayes Complement

x = NBC_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,NBC_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,NBC_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('Naive Bayes Complement Model_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

It is obvious that the training accuracy value tends to decrease from state to state although it is still high (the lowest value is 71%).

In [283]:
#plot the results with Naive Bayes Complement

x = NBC_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,NBC_time[i,],label='Running time', color = 'red', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Running time')
    ax[i].set_title('Naive Bayes Complement Model_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

Running time in random state 2 is highly stable compared to other random states.

For Naive Bayes Complement:

In random state 0, it is best to pick a testing size = 5%, while the figures for random state 1 and 2 are 10% and 5% respectively.

In [284]:
#plot comprehensive graphs with Naive Bayes Algorithm

x = NBB_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,NB_train_accuracy[i,],label='Gaussian', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,NBB_train_accuracy[i,],label='Bernoulli', color = 'red', marker='*',markersize=8)
    ax[i].plot(x,NBC_train_accuracy[i,],label='Complement', color = 'green', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('Naive Bayes Algorithm Model_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

It is obvious that Gaussian and Bernoulli bring out the better and more stable results than Complement.

In [285]:
#plot comprehensive graphs with Naive Bayes Algorithm

x = NBB_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,NB_time[i,],label='Gaussian', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,NBB_time[i,],label='Bernoulli', color = 'red', marker='*',markersize=8)
    ax[i].plot(x,NBC_time[i,],label='Complement', color = 'green', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('Naive Bayes Algorithm Model_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

It is hard to point out which algorithm option is the fastest one based on these graphs.

All things considered, when we use Naive Bayes Algorithm for the heart dataset, the results obtained with all the three methods are good (high accuracy and shor running time). However, Gaussian and Bernoulli should be evaluated better than the remaining.

Classification Tree

In [286]:
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.tree import plot_tree
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO  
from IPython.display import Image  
import pydotplus
D:\Anaconda\lib\site-packages\sklearn\externals\six.py:31: FutureWarning: The module is deprecated in version 0.21 and will be removed in version 0.23 since we've dropped support for Python 2.7. Please rely on the official version of six (https://pypi.org/project/six/).
  "(https://pypi.org/project/six/).", FutureWarning)

For the classification tree, we care about two parameters: the criterion ('Gini','Entropy' / 'Gini' as default) and the max_depth (int, default=None / The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples).

In [287]:
# Approach 1: Decision Tree using Gini
# Build a learning curve for the test size
# max_depth = default

DTG_train_accuracy = np.zeros([3,19])
DTG_test_accuracy = np.zeros([3,19])
DTG_time = np.zeros([3,19])
DTG_size = np.zeros([3,19])

for i in range (3):
    j = 0.05
    k = 0
    while j <1:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
        
        Start_time = time.time() #Saving current time
        
        dtree = DecisionTreeClassifier()  #Gini is the default classifier 
        dtree.fit(X_train,y_train)

        y_pred_Train = dtree.predict(X_train) #Predictions
        y_pred_Test = dtree.predict(X_test) #Predictions
        
        End_time = time.time() #Saving current time
        
        DTG_size[i,k] = j
    
        DTG1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        DTG_train_accuracy[i,k] = DTG1

        DTG2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        DTG_test_accuracy[i,k] = DTG2

        DTG3 = float(round(End_time - Start_time,6))
        DTG_time[i,k] = DTG3
    
        j+= 0.05
        k+= 1
In [288]:
#plot the results with Decision Tree using Gini

x = DTG_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,DTG_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,DTG_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('Decision Tree using Gini Model_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

The training set accuracy is always 100% while the testing set accuracy varies. In random state 0, a testing size of 25% is the best while the figures for random state 1 and 2 are 10% and 5% respectively. A testing size of 10% will be fairly good in all the three random states.

In [289]:
#plot the results with Decision Tree using Gini

x = DTG_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,DTG_time[i,],label='Running time', color = 'red', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Running time')
    ax[i].set_title('Decision Tree using Gini Model_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

The running time recorded in random state 1 is the lowest. Also, running the model with a testing size of 10% will take little time.

In [294]:
# Approach 1: Decision Tree using Gini
# Build a learning curve for the max_depth parameter
# choose test size = 10%

DTGM_train_accuracy = np.zeros([3,27])
DTGM_test_accuracy = np.zeros([3,27])
DTGM_time = np.zeros([3,27])
DTGM_size = np.zeros([3,27])

for i in range (3):
    j = 1
    k = 0
    while j <=27:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.1,random_state=i)
        
        Start_time = time.time() #Saving current time
        
        dtree = DecisionTreeClassifier(max_depth=j)  #Gini is the default classifier 
        dtree.fit(X_train,y_train)

        y_pred_Train = dtree.predict(X_train) #Predictions
        y_pred_Test = dtree.predict(X_test) #Predictions
        
        End_time = time.time() #Saving current time
        
        DTGM_size[i,k] = j
    
        DTGM1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        DTGM_train_accuracy[i,k] = DTGM1

        DTGM2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        DTGM_test_accuracy[i,k] = DTGM2

        DTGM3 = float(round(End_time - Start_time,6))
        DTGM_time[i,k] = DTGM3
    
        j+= 1
        k+= 1
In [295]:
#plot the results with Decision Tree using Gini

x = DTGM_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,DTGM_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,DTGM_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Max_Depth')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('Decision Tree using Gini Model_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

Inferred from the graphs, in random state 0 and 1, the best max_depth number is 4 while the figure for random state 2 is 6. I will choose max_depth = 4.

Here we have a combination between testting size and max_depth for Decision Tree using Gini: testing size = 10% and max_depth = 4.

In [297]:
#plot the results with Decision Tree using Gini

x = DTGM_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,DTGM_time[i,],label='Running time', color = 'red', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Max_Depth')
    ax[i].set_ylabel('Running time')
    ax[i].set_title('Decision Tree using Gini Model_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

The running time of the algorithm using Gini with different Max_Depth varies considerably. With max_depth = 4, the running time is relatively low.

In [298]:
# Decision Tree using Gini with max_depth = 4 with different testing size in different times
# These below results will be used for later comparison

DTGN_train_accuracy = np.zeros([3,19])
DTGN_test_accuracy = np.zeros([3,19])
DTGN_time = np.zeros([3,19])
DTGN_size = np.zeros([3,19])

for i in range (3):
    j = 0.05
    k = 0
    while j <=1:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
        
        Start_time = time.time() #Saving current time
        
        dtree = DecisionTreeClassifier(max_depth=4)  #Gini is the default classifier 
        dtree.fit(X_train,y_train)

        y_pred_Train = dtree.predict(X_train) #Predictions
        y_pred_Test = dtree.predict(X_test) #Predictions
        
        End_time = time.time() #Saving current time
        
        DTGN_size[i,k] = j
    
        DTGN1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        DTGN_train_accuracy[i,k] = DTGN1

        DTGN2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        DTGN_test_accuracy[i,k] = DTGN2

        DTGN3 = float(round(End_time - Start_time,6))
        DTGN_time[i,k] = DTGN3
    
        j+= 0.05
        k+= 1
In [301]:
# Approach 2: Decision Tree using Entropy
# Build a learning curve for the test size
# max_depth = default

DTE_train_accuracy = np.zeros([3,19])
DTE_test_accuracy = np.zeros([3,19])
DTE_time = np.zeros([3,19])
DTE_size = np.zeros([3,19])

for i in range (3):
    j = 0.05
    k = 0
    while j <=1:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
        
        Start_time = time.time() #Saving current time
        
        dtree = DecisionTreeClassifier(criterion='entropy')
        dtree.fit(X_train,y_train)

        y_pred_Train = dtree.predict(X_train) #Predictions
        y_pred_Test = dtree.predict(X_test) #Predictions
        
        End_time = time.time() #Saving current time
        
        DTE_size[i,k] = j
    
        DTE1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        DTE_train_accuracy[i,k] = DTE1

        DTE2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        DTE_test_accuracy[i,k] = DTE2

        DTE3 = float(round(End_time - Start_time,6))
        DTE_time[i,k] = DTE3
    
        j+= 0.05
        k+= 1
In [302]:
#plot the results with Decision Tree using Entropy

x = DTE_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,DTE_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,DTE_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('Decision Tree using Entropy Model_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

The best testing size for all the random states is 5%, 10% and 5% respectively. It is sensible to choose a shared value of 5% for all the random states.

In [303]:
# Approach 2: Decision Tree using Gini
# Build a learning curve for the max_depth parameter
# choose test size = 5%

DTEM_train_accuracy = np.zeros([3,27])
DTEM_test_accuracy = np.zeros([3,27])
DTEM_time = np.zeros([3,27])
DTEM_size = np.zeros([3,27])

for i in range (3):
    j = 1
    k = 0
    while j <=27:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.05,random_state=i)
        
        Start_time = time.time() #Saving current time
        
        dtree = DecisionTreeClassifier(criterion='entropy',max_depth=j)  #Gini is the default classifier 
        dtree.fit(X_train,y_train)

        y_pred_Train = dtree.predict(X_train) #Predictions
        y_pred_Test = dtree.predict(X_test) #Predictions
        
        End_time = time.time() #Saving current time
        
        DTEM_size[i,k] = j
    
        DTEM1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        DTEM_train_accuracy[i,k] = DTEM1

        DTEM2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        DTEM_test_accuracy[i,k] = DTEM2

        DTEM3 = float(round(End_time - Start_time,6))
        DTEM_time[i,k] = DTEM3
    
        j+= 1
        k+= 1
In [304]:
#plot the results with Decision Tree using Entropy

x = DTEM_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,DTEM_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,DTEM_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('Decision Tree using Entropy Model_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

The suitable max_depth for three random states is 4, 6 and 12 respectively. The number of 8 max_depth will be fairly good in all the random states.

In [306]:
# Decision Tree using Entropy with max_depth = 8 with different testing size in different times
# These below results will be used for later comparison

DTEY_train_accuracy = np.zeros([3,19])
DTEY_test_accuracy = np.zeros([3,19])
DTEY_time = np.zeros([3,19])
DTEY_size = np.zeros([3,19])

for i in range (3):
    j = 0.05
    k = 0
    while j <=1:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
        
        Start_time = time.time() #Saving current time
        
        dtree = DecisionTreeClassifier(criterion = 'entropy',max_depth=8)
        dtree.fit(X_train,y_train)

        y_pred_Train = dtree.predict(X_train) #Predictions
        y_pred_Test = dtree.predict(X_test) #Predictions
        
        End_time = time.time() #Saving current time
        
        DTEY_size[i,k] = j
    
        DTEY1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        DTEY_train_accuracy[i,k] = DTEY1

        DTEY2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        DTEY_test_accuracy[i,k] = DTEY2

        DTEY3 = float(round(End_time - Start_time,6))
        DTEY_time[i,k] = DTEY3
    
        j+= 0.05
        k+= 1

Now we will compare Gini and Entropy in their best combination.

In [307]:
# plot the comparison

x = DTEY_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,DTGN_train_accuracy[i,],label='Gini Training accuracy', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,DTGN_test_accuracy[i,],label='Gini Testing accuracy', color = 'blue', marker='*',markersize=8)
    ax[i].plot(x,DTEY_train_accuracy[i,],label='Entropy Training accuracy', color = 'yellow', marker='*',markersize=8)
    ax[i].plot(x,DTEY_test_accuracy[i,],label='Entropy Testing accuracy', color = 'green', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('Decision Tree using Gini Entropy_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

It is obvious that Entropy bring better results than Gini. Also, in their preferred testing size (10% for Gini and 5% for Entropy), the models using these two methods are good for predicting.

In [309]:
# plot the comparison

x = DTEY_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,DTGN_time[i,],label='Gini running time', color = 'red', marker='*',markersize=8)
    ax[i].plot(x,DTEY_time[i,],label='Entropy running time', color = 'magenta', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Running time')
    ax[i].set_title('Decision Tree using Gini Entropy_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

It is obvious that the running time of model using Entropy is mostly greater than the figure for Gini.

In conclusion, if we use the classification tree for the heart dataset, it is better to adopt the Entropy method with a combination of 5% testing size and max_depth = 8.

In [311]:
# plot the tree with the findings:

import os

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.05)

dtree = DecisionTreeClassifier(criterion='entropy', max_depth=8)
dtree.fit(X_train,y_train)

y_pred_Train = dtree.predict(X_train) #Predictions
y_pred_Test = dtree.predict(X_test) #Predictions

dot_data = StringIO()

os.environ['PATH'] = os.environ['PATH']+';'+os.environ['CONDA_PREFIX']+r"\Library\bin\graphviz"

export_graphviz(dtree, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,feature_names = feature_cols,class_names=['0','1'])

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  

graph.write_png('heart_prunned.png')

Image(graph.create_png())
Out[311]:

Random Forest

In terms of the random forest, there are many parameter to be cared about, such as:

n_estimators: int,default=100 (the number of trees in the forest)

criterion: {“gini”, “entropy”} default=”gini” (the function to measure the quality of a split)

max_depth: int, default=None (the maximum depth of the tree)

max_features: {“auto”, “sqrt”, “log2”} int or float / default=”auto” (number of features to consider when looking for the best split). "auto" has the same function as "sqrt".

In this project we will focus on the max_features parameter, and find the best testing size to combine with the options of max_features. This work will also compare the results obtained using Gini and Entropy Function in different times. Other parameters will be set to their default values.

In [312]:
from sklearn.ensemble import RandomForestClassifier
In [315]:
# Approach 1: Random Forest using Gini and "sqrt" max_features
# build a learning curve for the testing size

RFGS_train_accuracy = np.zeros([3,19])
RFGS_test_accuracy = np.zeros([3,19])
RFGS_time = np.zeros([3,19])
RFGS_size = np.zeros([3,19])

for i in range (3):
    j = 0.05
    k = 0
    while j <=1:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
        
        Start_time = time.time() #Saving current time
        
        dForest = RandomForestClassifier(criterion = 'gini',max_features='sqrt')
        dForest.fit(X_train,y_train)

        y_pred_Train = dForest.predict(X_train) #Predictions
        y_pred_Test = dForest.predict(X_test) #Predictions
        
        End_time = time.time() #Saving current time
        
        RFGS_size[i,k] = j
    
        RFGS1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        RFGS_train_accuracy[i,k] = RFGS1

        RFGS2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        RFGS_test_accuracy[i,k] = RFGS2

        RFGS3 = float(round(End_time - Start_time,6))
        RFGS_time[i,k] = RFGS3
    
        j+= 0.05
        k+= 1
In [316]:
# plot the obtained results:

x = RFGS_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,RFGS_train_accuracy[i,],label='Gini Sqrt Train accuracy', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,RFGS_test_accuracy[i,],label='Gini Sqrt Test accuracy', color = 'green', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('Random Forest using Gini and Sqrt_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

The training accuracy metrics is always 100% while the value for testing tends to decrease. The best testing size for each random state is 20%, 5% and 5% respectively. A testing size of 10% is extremely good in all the random states.

In [317]:
# Approach 2: Random Forest using Gini and "log2" max_features
# build a learning curve for the testing size

RFGL_train_accuracy = np.zeros([3,19])
RFGL_test_accuracy = np.zeros([3,19])
RFGL_time = np.zeros([3,19])
RFGL_size = np.zeros([3,19])

for i in range (3):
    j = 0.05
    k = 0
    while j <=1:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
        
        Start_time = time.time() #Saving current time
        
        dForest = RandomForestClassifier(criterion = 'gini',max_features='log2')
        dForest.fit(X_train,y_train)

        y_pred_Train = dForest.predict(X_train) #Predictions
        y_pred_Test = dForest.predict(X_test) #Predictions
        
        End_time = time.time() #Saving current time
        
        RFGL_size[i,k] = j
    
        RFGL1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        RFGL_train_accuracy[i,k] = RFGL1

        RFGL2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        RFGL_test_accuracy[i,k] = RFGL2

        RFGL3 = float(round(End_time - Start_time,6))
        RFGL_time[i,k] = RFGL3
    
        j+= 0.05
        k+= 1
In [318]:
# plot the obtained results:

x = RFGL_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,RFGL_train_accuracy[i,],label='Gini Log2 Train accuracy', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,RFGL_test_accuracy[i,],label='Gini Log2 Test accuracy', color = 'green', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('Random Forest using Gini and Log2_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

The best testing size for each random state is 20%, 5% and 5% in turn. A testing size of 10% is extremely good in all the random states.

In [319]:
# Approach 3: Random Forest using Entropy and "sqrt" max_features
# build a learning curve for the testing size

RFES_train_accuracy = np.zeros([3,19])
RFES_test_accuracy = np.zeros([3,19])
RFES_time = np.zeros([3,19])
RFES_size = np.zeros([3,19])

for i in range (3):
    j = 0.05
    k = 0
    while j <=1:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
        
        Start_time = time.time() #Saving current time
        
        dForest = RandomForestClassifier(criterion = 'entropy',max_features='sqrt')
        dForest.fit(X_train,y_train)

        y_pred_Train = dForest.predict(X_train) #Predictions
        y_pred_Test = dForest.predict(X_test) #Predictions
        
        End_time = time.time() #Saving current time
        
        RFES_size[i,k] = j
    
        RFES1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        RFES_train_accuracy[i,k] = RFES1

        RFES2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        RFES_test_accuracy[i,k] = RFES2

        RFES3 = float(round(End_time - Start_time,6))
        RFES_time[i,k] = RFES3
    
        j+= 0.05
        k+= 1
In [320]:
# plot the obtained results:

x = RFES_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,RFES_train_accuracy[i,],label='Entropy sqrt Train accuracy', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,RFES_test_accuracy[i,],label='Entropy sqrt Test accuracy', color = 'green', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('Random Forest using Entropy and sqrt_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

The best testing size for each random state is 15%, 10% and 5% in turn. A testing size of 10% is extremely good in all the random states.

In [321]:
# Approach 3: Random Forest using Entropy and "log2" max_features
# build a learning curve for the testing size

RFEL_train_accuracy = np.zeros([3,19])
RFEL_test_accuracy = np.zeros([3,19])
RFEL_time = np.zeros([3,19])
RFEL_size = np.zeros([3,19])

for i in range (3):
    j = 0.05
    k = 0
    while j <=1:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
        
        Start_time = time.time() #Saving current time
        
        dForest = RandomForestClassifier(criterion = 'entropy',max_features='log2')
        dForest.fit(X_train,y_train)

        y_pred_Train = dForest.predict(X_train) #Predictions
        y_pred_Test = dForest.predict(X_test) #Predictions
        
        End_time = time.time() #Saving current time
        
        RFEL_size[i,k] = j
    
        RFEL1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        RFEL_train_accuracy[i,k] = RFEL1

        RFEL2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        RFEL_test_accuracy[i,k] = RFEL2

        RFEL3 = float(round(End_time - Start_time,6))
        RFEL_time[i,k] = RFEL3
    
        j+= 0.05
        k+= 1
In [322]:
# plot the obtained results:

x = RFEL_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,RFEL_train_accuracy[i,],label='Entropy log2 Train accuracy', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,RFEL_test_accuracy[i,],label='Entropy log2 Test accuracy', color = 'green', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('Random Forest using Entropy and log2_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

The best testing size for each random state is 20%, 10% and 5% in turn. A testing size of 18% is extremely good in all the random states.

In [323]:
# compare all the obtained results

# plot the comparison

x = RFEL_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,RFGS_test_accuracy[i,],label='Gini sqrt test accuracy', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,RFGL_test_accuracy[i,],label='Gini log2 Test accuracy', color = 'red', marker='*',markersize=8)
    ax[i].plot(x,RFES_test_accuracy[i,],label='Entropy sqrt test accuracy', color = 'yellow', marker='*',markersize=8)
    ax[i].plot(x,RFEL_test_accuracy[i,],label='Entropy log2 Test accuracy', color = 'green', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('Random Forest using Gini and Entropy_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

Overall, it is not easy to tell which label is attached to the best results based on the graphs.

In [324]:
# compare all the obtained results

# plot the comparison

x = RFEL_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,RFGS_time[i,],label='Gini sqrt running time', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,RFGL_time[i,],label='Gini log2 running time', color = 'red', marker='*',markersize=8)
    ax[i].plot(x,RFES_time[i,],label='Entropy sqrt running time', color = 'yellow', marker='*',markersize=8)
    ax[i].plot(x,RFEL_time[i,],label='Entropy log2 running time', color = 'green', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Testing size')
    ax[i].set_ylabel('Running time')
    ax[i].set_title('Random Forest using Gini and Entropy_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

Also, it is hard to compare the running time of these four methods.

Overall, for dealing with the heart dataset, when it comes to random forest, if we just play with criterion and max_features parameter, the derived results are similar and there are no outstanding ones, which indicates that at least one important parameter might not be considered in our case. However, all the obtained results are still good.

In [326]:
# plot the tree with one option: Entropy and log2

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.18)

dForest = RandomForestClassifier(criterion = 'entropy',max_features='log2')
dForest.fit(X_train,y_train)

y_pred_Train = dForest.predict(X_train) #Predictions
y_pred_Test = dForest.predict(X_test) #Predictions

dot_data = StringIO()

os.environ['PATH'] = os.environ['PATH']+';'+os.environ['CONDA_PREFIX']+r"\Library\bin\graphviz"

export_graphviz(dtree, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,feature_names = feature_cols,class_names=['0','1'])

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  

graph.write_png('heart_prunned.png')

Image(graph.create_png())
Out[326]:

K Nearest Neighbors

In K Nearest Neighbors (KNN), we have to care about one important parameter: n_neighbors

Similar to the work with other algorithms, in this part I will continue to try to find a best combination of testing size and parameter after running the algorithms several times.

Firstly, I will try to obtain the best n_neighbors. The n_neighbors can range between 1 and the number of all the datapoints in the dataset. The dataset has 303 datapoints as a total. I will create a loop for n_neighbors which runs from 1 to 50. About the test size, at first I choose three different test size: 10%, 20% and 40%.

In [329]:
from sklearn.neighbors import KNeighborsClassifier
In [333]:
# Test size = 10%
# Run the model three times, three random states

KNN_train_accuracy = np.zeros([3,50])
KNN_test_accuracy = np.zeros([3,50])
KNN_time = np.zeros([3,50])
KNN_number = np.zeros([3,50])

for i in range (3):
    j = 1
    k = 0
    while j <=50:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.1,random_state=i)
        
        Start_time = time.time() #Saving current time
        
        KNN = KNeighborsClassifier(n_neighbors=j)
        KNN.fit(X_train,y_train)

        y_pred_Train = KNN.predict(X_train) #Predictions
        y_pred_Test = KNN.predict(X_test) #Predictions
        
        End_time = time.time() #Saving current time
        
        KNN_number[i,k] = j
    
        KNN1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        KNN_train_accuracy[i,k] = KNN1

        KNN2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        KNN_test_accuracy[i,k] = KNN2

        KNN3 = float(round(End_time - Start_time,6))
        KNN_time[i,k] = KNN3
    
        j+= 1
        k+= 1
In [334]:
# plot the obtained results:

x = KNN_number[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,KNN_train_accuracy[i,],label='Train accuracy', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,KNN_test_accuracy[i,],label='Test accuracy', color = 'green', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Number of Neighbors')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('KNN_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

Based on the graphs, a number of neighbors around 25 when the testing size = 10% will be extremely good.

In [335]:
# Test size = 20%
# Run the model three times, three random states

KNNa_train_accuracy = np.zeros([3,50])
KNNa_test_accuracy = np.zeros([3,50])
KNNa_time = np.zeros([3,50])
KNNa_number = np.zeros([3,50])

for i in range (3):
    j = 1
    k = 0
    while j <=50:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=i)
        
        Start_time = time.time() #Saving current time
        
        KNN = KNeighborsClassifier(n_neighbors=j)
        KNN.fit(X_train,y_train)

        y_pred_Train = KNN.predict(X_train) #Predictions
        y_pred_Test = KNN.predict(X_test) #Predictions
        
        End_time = time.time() #Saving current time
        
        KNNa_number[i,k] = j
    
        KNNa1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        KNNa_train_accuracy[i,k] = KNNa1

        KNNa2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        KNNa_test_accuracy[i,k] = KNNa2

        KNNa3 = float(round(End_time - Start_time,6))
        KNNa_time[i,k] = KNNa3
    
        j+= 1
        k+= 1
In [336]:
# plot the obtained results:

x = KNNa_number[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,KNNa_train_accuracy[i,],label='Train accuracy', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,KNNa_test_accuracy[i,],label='Test accuracy', color = 'green', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Number of Neighbors')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('KNN_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

In random state 0, n_neighbors around 40 is good while in random state 1 and 2, n_neighbors between 20 and 25 is good.

In [337]:
# Test size = 40%
# Run the model three times, three random states

KNNb_train_accuracy = np.zeros([3,50])
KNNb_test_accuracy = np.zeros([3,50])
KNNb_time = np.zeros([3,50])
KNNb_number = np.zeros([3,50])

for i in range (3):
    j = 1
    k = 0
    while j <=50:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4,random_state=i)
        
        Start_time = time.time() #Saving current time
        
        KNN = KNeighborsClassifier(n_neighbors=j)
        KNN.fit(X_train,y_train)

        y_pred_Train = KNN.predict(X_train) #Predictions
        y_pred_Test = KNN.predict(X_test) #Predictions
        
        End_time = time.time() #Saving current time
        
        KNNb_number[i,k] = j
    
        KNNb1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        KNNb_train_accuracy[i,k] = KNNb1

        KNNb2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        KNNb_test_accuracy[i,k] = KNNb2

        KNNb3 = float(round(End_time - Start_time,6))
        KNNb_time[i,k] = KNNb3
    
        j+= 1
        k+= 1
In [338]:
# plot the obtained results:

x = KNNb_number[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,KNNb_train_accuracy[i,],label='Train accuracy', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,KNNb_test_accuracy[i,],label='Test accuracy', color = 'green', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Number of Neighbors')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('KNN_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

In random state 0, n_neighbors = 25 is good while the figures for random state 1 and 2 are 20 and around 3 respectively.

In [339]:
# compare the obtained results with three different testing size
# plot the testing accuracy

x = KNNb_number[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,KNN_test_accuracy[i,],label='testing size = 0.1', color = 'blue', marker='*',markersize=8)
    ax[i].plot(x,KNNa_test_accuracy[i,],label='testing size = 0.2', color = 'red', marker='*',markersize=8)
    ax[i].plot(x,KNNb_test_accuracy[i,],label='testing size = 0.4', color = 'yellow', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('Number of Neighbors')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('KNN_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

Based on the graphs, we find that a number of neighbors ranging between 20 and 25 will be enough good for all the random state. Pick n_neighbors = 23.

In [386]:
# Build a learning curve for the testing size when n_neighbors = 23.
# n_neighbors = 23 => the training set number of datapoints must be greater or equal to 23, 
# so training size >= 0.076 and the test size <= 0.924

KNNc_train_accuracy = np.zeros([3,17])
KNNc_test_accuracy = np.zeros([3,17])
KNNc_time = np.zeros([3,17])
KNNc_size = np.zeros([3,17])

for i in range (3):
    j = 0.05
    k = 0
    while j <=0.9:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
        
        Start_time = time.time() #Saving current time
        
        KNN = KNeighborsClassifier(n_neighbors=23)
        KNN.fit(X_train,y_train)

        y_pred_Train = KNN.predict(X_train) #Predictions
        y_pred_Test = KNN.predict(X_test) #Predictions
        
        End_time = time.time() #Saving current time
        
        KNNc_size[i,k] = j
    
        KNNc1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        KNNc_train_accuracy[i,k] = KNNc1

        KNNc2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        KNNc_test_accuracy[i,k] = KNNc2

        KNNc3 = float(round(End_time - Start_time,6))
        KNNc_time[i,k] = KNNc3
    
        j+= 0.05
        k+= 1
In [345]:
# compare the obtained results in three different testing size
# plot the testing accuracy

x = KNNc_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,KNNc_train_accuracy[i,],label='training accuracy', color = 'blue', marker='*',markersize=8)
    ax[i].plot(x,KNNc_test_accuracy[i,],label='testing accuracy', color = 'red', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('KNN with n_neighbors = 23_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

Based on the graphs, we can see that a testing size = 0.1 is enough good for predicting in all the three random states.

In [349]:
# compare the obtained results in three different testing size
# plot the testing accuracy

x = KNNc_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,KNNc_time[i,],label='running time', color = 'magenta', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('testing size')
    ax[i].set_ylabel('Running time')
    ax[i].set_title('KNN with n_neighbors = 23_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

Random state 1 witnesses the lowest running time of KNN. testing size = 0.1 makes the algorithm take more time to run than other testing size.

In conclusion, we have a combination of testing size = 0.1 (10%) and n_neighbors = 23.

Extra Trees Classifier

In [350]:
from sklearn.ensemble import ExtraTreesClassifier
In [351]:
# Build a learning curve for the testing size.

ET_train_accuracy = np.zeros([3,19])
ET_test_accuracy = np.zeros([3,19])
ET_time = np.zeros([3,19])
ET_size = np.zeros([3,19])

for i in range (3):
    j = 0.05
    k = 0
    while j <=1:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
        
        Start_time = time.time() #Saving current time
        
        ETrees = ExtraTreesClassifier()
        ETrees.fit(X_train,y_train)

        y_pred_Train = ETrees.predict(X_train) #Predictions
        y_pred_Test = ETrees.predict(X_test) #Predictions
        
        End_time = time.time() #Saving current time
        
        ET_size[i,k] = j
    
        ET1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        ET_train_accuracy[i,k] = ET1

        ET2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        ET_test_accuracy[i,k] = ET2

        ET3 = float(round(End_time - Start_time,6))
        ET_time[i,k] = ET3
    
        j+= 0.05
        k+= 1
In [352]:
# plot the results

x = ET_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,ET_train_accuracy[i,],label='training accuracy', color = 'blue', marker='*',markersize=8)
    ax[i].plot(x,ET_test_accuracy[i,],label='testing accuracy', color = 'red', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('Extra Trees_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

The obtained training accuracy metrics with extra trees is always 100%. The best testing size for each random state is 18%, 10% and 5% respectively. A testing size of 10% is good enough for all the three random states.

In [353]:
# plot the results

x = ET_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,ET_time[i,],label='running time', color = 'magenta', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('testing size')
    ax[i].set_ylabel('running time')
    ax[i].set_title('Extra Trees_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

the running time of the algorithm in random state 0 is greater than the figure for other random states. 10% testing size makes running the algorithm faster than the algorithm using most of the other testing size in random state 0 and 1.

Gradient Boost Classifier

In [355]:
from sklearn.ensemble import GradientBoostingClassifier 
In [356]:
# Build a learning curve for the testing size.

GB_train_accuracy = np.zeros([3,19])
GB_test_accuracy = np.zeros([3,19])
GB_time = np.zeros([3,19])
GB_size = np.zeros([3,19])

for i in range (3):
    j = 0.05
    k = 0
    while j <=1:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
        
        Start_time = time.time() #Saving current time
        
        GBoost = GradientBoostingClassifier()
        GBoost.fit(X_train,y_train)

        y_pred_Train = GBoost.predict(X_train) #Predictions
        y_pred_Test = GBoost.predict(X_test) #Predictions
        
        End_time = time.time() #Saving current time
        
        GB_size[i,k] = j
    
        GB1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        GB_train_accuracy[i,k] = GB1

        GB2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        GB_test_accuracy[i,k] = GB2

        GB3 = float(round(End_time - Start_time,6))
        GB_time[i,k] = GB3
    
        j+= 0.05
        k+= 1
In [357]:
# plot the results

x = GB_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,GB_train_accuracy[i,],label='training accuracy', color = 'blue', marker='*',markersize=8)
    ax[i].plot(x,GB_test_accuracy[i,],label='testing accuracy', color = 'red', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('Gradient Boost_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

The training accuracy is 100% most of the time in all three random states. The best testing size for each random state is 30%, 10% and 10% respectively. 10% testing size is good enough in all three random states.

In [358]:
# plot the results

x = GB_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,GB_time[i,],label='running time', color = 'magenta', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('testing size')
    ax[i].set_ylabel('running time')
    ax[i].set_title('Gradient Boost_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

testing size = 10% makes the algorithm run faster than the algorithm using most of other testing size in random state 0. However, this does not happen in the other random states.

Support Vector Classifier

In [359]:
from sklearn.svm import SVC

When it comes to Support Vector Classifier, we need to care about the kernel parameter.

kernel: {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’}, default=’rbf’

Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. If a callable is given it is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape (n_samples, n_samples). This does not occur in the heart dataset.

In [360]:
# Approach 1: SVC with linear
# build a learning curve for testing size

# Build a learning curve for the testing size.

SVCL_train_accuracy = np.zeros([3,19])
SVCL_test_accuracy = np.zeros([3,19])
SVCL_time = np.zeros([3,19])
SVCL_size = np.zeros([3,19])

for i in range (3):
    j = 0.05
    k = 0
    while j <=1:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
        
        Start_time = time.time() #Saving current time
        
        SVector = SVC(kernel = 'linear')
        SVector.fit(X_train,y_train)

        y_pred_Train = SVector.predict(X_train) #Predictions
        y_pred_Test = SVector.predict(X_test) #Predictions
        
        End_time = time.time() #Saving current time
        
        SVCL_size[i,k] = j
    
        SVCL1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        SVCL_train_accuracy[i,k] = SVCL1

        SVCL2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        SVCL_test_accuracy[i,k] = SVCL2

        SVCL3 = float(round(End_time - Start_time,6))
        SVCL_time[i,k] = SVCL3
    
        j+= 0.05
        k+= 1

# plot the results:

x = SVCL_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,SVCL_train_accuracy[i,],label='training accuracy', color = 'blue', marker='*',markersize=8)
    ax[i].plot(x,SVCL_test_accuracy[i,],label='testing accuracy', color = 'red', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('SVC Linear_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

Training and testing accuracy metrics look good. Testing size = 5% is suitable for random state 0 and 2, 10% is for the remaining.

In [361]:
# Approach 2: SVC with poly
# build a learning curve for testing size

# Build a learning curve for the testing size.

SVCP_train_accuracy = np.zeros([3,19])
SVCP_test_accuracy = np.zeros([3,19])
SVCP_time = np.zeros([3,19])
SVCP_size = np.zeros([3,19])

for i in range (3):
    j = 0.05
    k = 0
    while j <=1:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
        
        Start_time = time.time() #Saving current time
        
        SVector = SVC(kernel = 'poly')
        SVector.fit(X_train,y_train)

        y_pred_Train = SVector.predict(X_train) #Predictions
        y_pred_Test = SVector.predict(X_test) #Predictions
        
        End_time = time.time() #Saving current time
        
        SVCP_size[i,k] = j
    
        SVCP1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        SVCP_train_accuracy[i,k] = SVCP1

        SVCP2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        SVCP_test_accuracy[i,k] = SVCP2

        SVCP3 = float(round(End_time - Start_time,6))
        SVCP_time[i,k] = SVCP3
    
        j+= 0.05
        k+= 1

# plot the results:

x = SVCP_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,SVCP_train_accuracy[i,],label='training accuracy', color = 'blue', marker='*',markersize=8)
    ax[i].plot(x,SVCP_test_accuracy[i,],label='testing accuracy', color = 'red', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('SVC Poly_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

Testing size = 5% is suitable for random state 0 and 2, 10% is for the remaining.

In [362]:
# Approach 3: SVC with rbf
# build a learning curve for testing size

# Build a learning curve for the testing size.

SVCR_train_accuracy = np.zeros([3,19])
SVCR_test_accuracy = np.zeros([3,19])
SVCR_time = np.zeros([3,19])
SVCR_size = np.zeros([3,19])

for i in range (3):
    j = 0.05
    k = 0
    while j <=1:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
        
        Start_time = time.time() #Saving current time
        
        SVector = SVC(kernel = 'rbf')
        SVector.fit(X_train,y_train)

        y_pred_Train = SVector.predict(X_train) #Predictions
        y_pred_Test = SVector.predict(X_test) #Predictions
        
        End_time = time.time() #Saving current time
        
        SVCR_size[i,k] = j
    
        SVCR1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        SVCR_train_accuracy[i,k] = SVCR1

        SVCR2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        SVCR_test_accuracy[i,k] = SVCR2

        SVCR3 = float(round(End_time - Start_time,6))
        SVCR_time[i,k] = SVCR3
    
        j+= 0.05
        k+= 1

# plot the results:

x = SVCR_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,SVCR_train_accuracy[i,],label='training accuracy', color = 'blue', marker='*',markersize=8)
    ax[i].plot(x,SVCR_test_accuracy[i,],label='testing accuracy', color = 'red', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('SVC Rcf_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

Testing size = 5% is suitable for random state 0 and 2, 10% is for the remaining.

In [363]:
# Approach 4: SVC with sigmoid
# build a learning curve for testing size

# Build a learning curve for the testing size.

SVCS_train_accuracy = np.zeros([3,19])
SVCS_test_accuracy = np.zeros([3,19])
SVCS_time = np.zeros([3,19])
SVCS_size = np.zeros([3,19])

for i in range (3):
    j = 0.05
    k = 0
    while j <=1:
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
        
        Start_time = time.time() #Saving current time
        
        SVector = SVC(kernel = 'sigmoid')
        SVector.fit(X_train,y_train)

        y_pred_Train = SVector.predict(X_train) #Predictions
        y_pred_Test = SVector.predict(X_test) #Predictions
        
        End_time = time.time() #Saving current time
        
        SVCS_size[i,k] = j
    
        SVCS1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
        SVCS_train_accuracy[i,k] = SVCS1

        SVCS2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
        SVCS_test_accuracy[i,k] = SVCS2

        SVCS3 = float(round(End_time - Start_time,6))
        SVCS_time[i,k] = SVCS3
    
        j+= 0.05
        k+= 1

# plot the results:

x = SVCS_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,SVCS_train_accuracy[i,],label='training accuracy', color = 'blue', marker='*',markersize=8)
    ax[i].plot(x,SVCS_test_accuracy[i,],label='testing accuracy', color = 'red', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('SVC Sigmoid_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

Testing size = 25% is suitable for random state 0, while the figures for random state 1 and 2 is 30% and 40% respectively.

In [365]:
# compared the obtained results

x = SVCS_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,SVCL_train_accuracy[i,],label='linear', color = 'blue', marker='*',markersize=8)
    ax[i].plot(x,SVCP_train_accuracy[i,],label='poly', color = 'red', marker='*',markersize=8)
    ax[i].plot(x,SVCR_train_accuracy[i,],label='rcf', color = 'green', marker='*',markersize=8)
    ax[i].plot(x,SVCS_train_accuracy[i,],label='sigmoid', color = 'yellow', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('testing size')
    ax[i].set_ylabel('Training Accuracy')
    ax[i].set_title('SVC with different kernels_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

It obvious that 'linear' kernel brings out the better results than other options.

In [366]:
# compared the obtained results

x = SVCS_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,SVCL_time[i,],label='linear', color = 'blue', marker='*',markersize=8)
    ax[i].plot(x,SVCP_time[i,],label='poly', color = 'red', marker='*',markersize=8)
    ax[i].plot(x,SVCR_time[i,],label='rcf', color = 'green', marker='*',markersize=8)
    ax[i].plot(x,SVCS_time[i,],label='sigmoid', color = 'yellow', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('testing size')
    ax[i].set_ylabel('Running time')
    ax[i].set_title('SVC with different kernels_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

However, running the algorithm with 'linear' kernel slows down the algorithm.

All things considered, SVC performs best in this case with 'linear' kernel. But if we worry about the running time, we can choose 'poly' kernel as an alternative.

Conclusion

For the classification problem with the heart dataset, the following algorithms are considered: Logistic Regression, Naïve Bayes, Classification Tree, Random Forest, K Nearest Neighbors, Extra Tree Classifier, Gradient Boost Classifier, Support Vector Classifier.

Overall, all the algorithms performs well with the heart dataset (high accuracy metrics values and short running time).

When it comes to findining the best testing size for each algorithm, it is not an easy work because the most suitable testing size varies when running the algorithms multiple times (different random states). I do make some recommendations in my work.

There is some findings derived for some algorithms in dealing with the heart dataset:

  • Logistic Regression: performs best when using 'Liblinear' solver after scaling the dataset.

  • Naïve Bayes: Gaussian and Bernoulli perform better than Complement

  • Classification Tree: 'Entropy' criterion and and Max_Depth = 8 make a best combination.

  • Random Forest: It is hard to compare among the methods, some important parameters might not have been considered.

  • KNN: n_neighbors = 23 brings out one of the best result.

  • Support vector Classifier: 'linear' kernel is the best option.

We can divide the algorithms in two groups:

  • Group 1: Logistic Regression, Naïve Bayes, K Nearest Neighbors.

  • Group 2: Classification Tree, Random Forest, Extra Tree Classifier, Gradient Boost Classifier, Support Vector Classifier.

The second group including more complex algorithms. Now, we will compare the running time of the algorithms in each group. Each algorithm, we just choose one best option.

Group 1

In [393]:
# plot the training accuracy metrics

a = np.array([[1,1],
             [1,1],
             [1,1]])

KNNd_train_accuracy = np.append(KNNc_train_accuracy,a,axis=1) #because the result matrix of KNN has different dimension [3,17]).

x = SVCS_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,liblinearsc_train_accuracy[i,],label='Logistic Regression', color = 'blue', marker='*',markersize=8)
    ax[i].plot(x,NB_train_accuracy[i,],label='Naive Bayes', color = 'red', marker='*',markersize=8)
    ax[i].plot(x,KNNd_train_accuracy[i,],label='K nearest neighbors', color = 'cyan', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('testing size')
    ax[i].set_ylabel('Accuracy')
    ax[i].set_title('Different Algorithms_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
In [394]:
# plot the running time

b = np.array([[0,0],
             [0,0],
             [0,0]])
KNNd_time = np.append(KNNc_time,b,axis=1)
#because the result matrix of KNN has different dimension [3,17]).

x = SVCS_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,liblinearsc_time[i,],label='Logistic Regression', color = 'blue', marker='*',markersize=8)
    ax[i].plot(x,NB_time[i,],label='Naive Bayes', color = 'red', marker='*',markersize=8)
    ax[i].plot(x,KNNd_time[i,],label='K nearest neighbors', color = 'cyan', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('testing size')
    ax[i].set_ylabel('Running time')
    ax[i].set_title('Different Algorithms_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

So, Logistic Regression is the best algorithm in Group 1.

Group 2

In [395]:
# plot the training accuracy metrics

x = SVCS_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,DTEY_train_accuracy[i,],label='Classification Tree', color = 'green', marker='*',markersize=8)
    ax[i].plot(x,RFGS_train_accuracy[i,],label='Random Forest', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,ET_train_accuracy[i,],label='Extra Trees', color = 'yellow', marker='*',markersize=8)
    ax[i].plot(x,GB_train_accuracy[i,],label='Gradient Boost', color = 'black', marker='*',markersize=8)
    ax[i].plot(x,SVCL_train_accuracy[i,],label='Support Vector Classifier', color = 'magenta', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('testing size')
    ax[i].set_ylabel('Running time')
    ax[i].set_title('Different Algorithms_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
In [396]:
# plot the running time

x = SVCS_size[0,]

fig, ax = plt.subplots(ncols=3, figsize=(15,4))

for i in range(3):
    ax[i].plot(x,DTEY_time[i,],label='Classification Tree', color = 'green', marker='*',markersize=8)
    ax[i].plot(x,RFGS_time[i,],label='Random Forest', color = 'navy', marker='*',markersize=8)
    ax[i].plot(x,ET_time[i,],label='Extra Trees', color = 'yellow', marker='*',markersize=8)
    ax[i].plot(x,GB_time[i,],label='Gradient Boost', color = 'black', marker='*',markersize=8)
    ax[i].plot(x,SVCL_time[i,],label='Support Vector Classifier', color = 'magenta', marker='*',markersize=8)
    ax[i].grid()
    ax[i].set_xlabel('testing size')
    ax[i].set_ylabel('Running time')
    ax[i].set_title('Different Algorithms_random state_%i' % i) 
    ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()

It is obvious that more complex algorithms (as group 2) need more time to run than group 1. For the heart dataset, it seems that we do not need to use complex algorithms. The ones in group 1 already bring out good results. In group 2, Classification tree is the fastest algorithm.

In conclusion, to recommend two best algorithms, each for one group, I would mention: Logistic Regression and Classifiaction Tree.

In [ ]: